Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Bilingual children at a young age can benefit from exposure to dual language, impacting their language and literacy development. Speech technology can aid in developing tools to accurately quantify children’s exposure to multiple languages, thereby helping parents, teachers, and early-childhood practitioners to better support bilingual children. This study lays the foundation towards this goal using the Hoff corpus containing naturalistic adult-child bilingual interactions collected at child ages 2½, 3, and 3½ years. Exploiting self-supervised learning features from XLSR-53 and HuBERT, we jointly predict the language (English/Spanish) and speaker (adult/child) in each utterance using a multi-task learning approach. Our experiments indicate that a trainable linear combination of embeddings across all Transformer layers of the SSL models is a stronger indicator for both tasks with more benefit to speaker classification. However, language classification for children remains challenging.more » « less
-
Speaker diarization has traditionally been explored using datasets that are either clean, feature a limited number of speakers, or have a large volume of data but lack the complexities of real-world scenarios. This study takes a unique approach by focusing on the Fearless Steps APOLLO audio resource, a challenging data that contains over 70,000 hours of audio data (A-11: 10k hrs), the majority of which remains unlabeled. This corpus presents considerable challenges such as diverse acoustic conditions, high levels of background noise, overlapping speech, data imbalance, and a variable number of speakers with varying utterance duration. To address these challenges, we propose a robust speaker diarization framework built on dynamic Graph Attention Network optimized using data augmentation. Our proposed framework attains a Diarization Error Rate (DER) of 19.6% when evaluated using ground truth speech segments. Notably, our work is the first to recognize, track, and perform conversational analysis on the entire Apollo-11 mission for speakers who were unidentified until now. This work stands as a significant contribution to both historical archiving and the development of robust diarization systems, particularly relevant for challenging real-world scenarios.more » « less
-
Dual-Path Minimum-Phase and All-Pass Decomposition Network for Single Channel Speech DereverberationWith the development of deep neural networks (DNN), many DNN-based speech dereverberation approaches have been proposed to achieve significant improvement over the traditional methods. However, most deep learning-based dereverberation methods solely focus on suppressing time-frequency domain reverberations without utilizing cepstral domain features which are potentially useful for dereverberation. In this paper, we propose a dual-path neural network structure to separately process minimum-phase and all-pass components of single channel speech. First, we decompose speech signal into minimum-phase and all-pass components in cepstral domain, then Conformer embedded U-Net is used to remove reverberations of both components. Finally, we combine these two processed components together to synthesize the enhanced output. The performance of proposed method is tested on REVERB-Challenge evaluation dataset in terms of commonly used objective metrics. Experimental results demonstrate that our method outperforms other compared methods.more » « less
-
Speech and language development are early indicators of overall analytical and learning ability in children. The preschool classroom is a rich language environment for monitoring and ensuring growth in young children by measuring their vocal interactions with teachers and classmates. Early childhood researchers are naturally interested in analyzing naturalistic vs controlled lab recordings to measure both quality and quantity of such interactions. Unfortunately, present-day speech technologies are not capable of addressing the wide dynamic scenario of early childhood classroom settings. Due to the diversity of acoustic events/conditions in such daylong audio streams, automated speaker diarization technology would need to be advanced to address this challenging domain for segmenting audio as well as information extraction. This study investigates alternate deep learning-based lightweight, knowledge-distilled, diarization solutions for segmenting classroom interactions of 3–5 years old children with teachers. In this context, the focus on speech-type diarization which classifies speech segments as being either from adults or children partitioned across multiple classrooms. Our lightest CNN model achieves a best F1-score of ∼76.0% on data from two classrooms, based on dev and test sets of each classroom. It is utilized with automatic speech recognition-based re-segmentation modules to perform child-adult diarization. Additionally, F1-scores are obtained for individual segments with corresponding speaker tags (e.g., adult vs child), which provide knowledge for educators on child engagement through naturalistic communications. The study demonstrates the prospects of addressing educational assessment needs through communication audio stream analysis, while maintaining both security and privacy of all children and adults. The resulting child communication metrics have been used for broad-based feedback for teachers with the help of visualizations.more » « less
-
ISCA (Ed.)In this paper, we present MixRep, a simple and effective data augmentation strategy based on mixup for low-resource ASR. MixRep interpolates the feature dimensions of hidden representations in the neural network that can be applied to both the acoustic feature input and the output of each layer, which generalizes the previous MixSpeech method. Further, we propose to combine the mixup with a regularization along the time axis of the input, which is shown as complementary. We apply MixRep to a Conformer encoder of an E2E LAS architecture trained with a joint CTC loss. We experiment on the WSJ dataset and subsets of the SWB dataset, covering reading and telephony conversational speech. Experimental results show that MixRep consistently outperforms other regularization methods for low-resource ASR. Compared to a strong SpecAugment baseline, MixRep achieves a +6.5% and a +6.7% relative WER reduction on the eval92 set and the Callhome part of the eval'2000 set.more » « less
An official website of the United States government
